¡Buen día Juan! Como te va? Espero que todo vaya bien!

Mi nombre es Emanuel Fitta! Un gusto conocerte, seré tu revisor en este proyecto.

A continuación un poco sobre la modalidad de revisión que usaremos:

Cuando enccuentro un error por primera vez, simplemente lo señalaré, te dejaré encontrarlo y arreglarlo tú cuenta. Además, a lo largo del texto iré haciendo algunas observaciones sobre mejora en tu código y también haré comentarios sobre tus percepciones sobre el tema. Pero si aún no puedes realizar esta tarea, te daré una pista más precisa en la próxima iteración y también algunos ejemplos prácticos. Estaré abierto a comentarios y discusiones sobre el tema.

Encontrará mis comentarios a continuación: no los mueva, modifique ni elimine.

Puedes encontrar mis comentarios en cuadros verdes, amarillos o rojos como este:

Comentario del revisor.

Exito. Todo se ha hecho de forma exitosa.

Comentario del revisor.

Observación. Algunas recomendaciones.

Comentario del revisor.

Necesita arreglos. Este apartado necesita algunas correcciones. El trabajo no puede ser aceptado con comentarios rojos.

Puedes responder utilizando esto:

Respuesta de estudiante.
In [ ]:
 
In [ ]:
 
Review General. (Iteración 1)

Muy bien Juan! Quiero felicitarte, el trabajo que has realizado ha sido sumamente completo. Tienes buenas prácticas en tu codigo, haces buen uso de agrupaciones y funciones de transformación y limpieza de datos. Los gráficos están bien realizados y son muy claros. Realmente te felicito mucho por este gran trabajo, si continuas así no tengo duda alguna de que llegarás a ser una gran gran analista de datos. Te felicito, por favor sigue así!

Te deseo mucho éxito!

Project 7¶

Step 1: Open file and survey¶

Import liberies

In [1]:
import pandas as pd
import numpy as np
from scipy import stats as st
In [2]:
# Read dataset.

df = pd.read_csv('/datasets/games.csv')
Comentario del revisor. (Iteración 1)

Veo que tienes el notebook con un error y creo que se debe al folder donde está guardado el archivo. No parece ser grave pues el resto del notebook está bien.

Data exploration

In [3]:
df.head()
Out[3]:
Name Platform Year_of_Release Genre NA_sales EU_sales JP_sales Other_sales Critic_Score User_Score Rating
0 Wii Sports Wii 2006.0 Sports 41.36 28.96 3.77 8.45 76.0 8 E
1 Super Mario Bros. NES 1985.0 Platform 29.08 3.58 6.81 0.77 NaN NaN NaN
2 Mario Kart Wii Wii 2008.0 Racing 15.68 12.76 3.79 3.29 82.0 8.3 E
3 Wii Sports Resort Wii 2009.0 Sports 15.61 10.93 3.28 2.95 80.0 8 E
4 Pokemon Red/Pokemon Blue GB 1996.0 Role-Playing 11.27 8.89 10.22 1.00 NaN NaN NaN
In [4]:
# I found missing data on 'Critic_Score', 'User_Score', 'Rating'; Wrong Dtype in 'Year_of_Release'.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             16713 non-null  object 
 1   Platform         16715 non-null  object 
 2   Year_of_Release  16446 non-null  float64
 3   Genre            16713 non-null  object 
 4   NA_sales         16715 non-null  float64
 5   EU_sales         16715 non-null  float64
 6   JP_sales         16715 non-null  float64
 7   Other_sales      16715 non-null  float64
 8   Critic_Score     8137 non-null   float64
 9   User_Score       10014 non-null  object 
 10  Rating           9949 non-null   object 
dtypes: float64(6), object(5)
memory usage: 1.4+ MB

Step 2: Data cleaning¶

Replace column names¶
In [5]:
# Lowered name columns.

df.columns = df.columns.str.lower()
df.columns
Out[5]:
Index(['name', 'platform', 'year_of_release', 'genre', 'na_sales', 'eu_sales',
       'jp_sales', 'other_sales', 'critic_score', 'user_score', 'rating'],
      dtype='object')
Comentario del revisor. (Iteración 1)

Muy bien Juan! Empezamos muy pero muy bien! utilizaste el método info, el cual nos ayuda a dar un vistazo general del dataframe. Además, veo que te adelantaste y empezaste a transformar tu dataframe cambiando los nombres de las columnas a minúsculas, lo cual es una buena práctica.

Look for duplicates¶
In [6]:
# I did not found duplicates

df['name'] = df['name'].str.lower()
df['platform'] = df['platform'].str.upper()
df['genre'] = df['genre'].str.lower()
df['rating'] = df['rating'].str.upper()
df.duplicated().sum()
Out[6]:
0
Handle missing data¶
In [7]:
# Missing Data, Im trying to understad why there are missing data, perhaps only older games have missing data.
# There are missing data in all years.

df[df["critic_score"].isna()]['year_of_release'].sort_values().unique()
Out[7]:
array([1980., 1981., 1982., 1983., 1984., 1985., 1986., 1987., 1988.,
       1989., 1990., 1991., 1992., 1993., 1994., 1995., 1996., 1997.,
       1998., 1999., 2000., 2001., 2002., 2003., 2004., 2005., 2006.,
       2007., 2008., 2009., 2010., 2011., 2012., 2013., 2014., 2015.,
       2016.,   nan])
In [8]:
df['critic_score'].sort_values().unique()
Out[8]:
array([13., 17., 19., 20., 21., 22., 23., 24., 25., 26., 27., 28., 29.,
       30., 31., 32., 33., 34., 35., 36., 37., 38., 39., 40., 41., 42.,
       43., 44., 45., 46., 47., 48., 49., 50., 51., 52., 53., 54., 55.,
       56., 57., 58., 59., 60., 61., 62., 63., 64., 65., 66., 67., 68.,
       69., 70., 71., 72., 73., 74., 75., 76., 77., 78., 79., 80., 81.,
       82., 83., 84., 85., 86., 87., 88., 89., 90., 91., 92., 93., 94.,
       95., 96., 97., 98., nan])
In [9]:
# name
print(df[df['name'].isna()])
      name platform  year_of_release genre  na_sales  eu_sales  jp_sales  \
659    NaN      GEN           1993.0   NaN      1.78      0.53      0.00   
14244  NaN      GEN           1993.0   NaN      0.00      0.00      0.03   

       other_sales  critic_score user_score rating  
659           0.08           NaN        NaN    NaN  
14244         0.00           NaN        NaN    NaN  
In [10]:
# I delete this 2 rows because there's no valuable information on them, I cannot know what games they are.
df.dropna(subset=['name'], inplace=True)
print(f"There are {df['name'].isna().sum()} missing values in 'name'")
There are 0 missing values in 'name'
In [11]:
# Missing data per column

print(f"there are {df['platform'].isna().sum()} missing values in 'platform'")
print(f"there are {df['year_of_release'].isna().sum()} missing values in 'year_of_release'")
print(f"there are {df['genre'].isna().sum()} missing values in 'genre'")
print(f"there are {df['na_sales'].isna().sum()} missing values in 'na_sales'")
print(f"there are {df['eu_sales'].isna().sum()} missing values in 'eu_sales'")
print(f"there are {df['jp_sales'].isna().sum()} missing values in 'jp_sales'")
print(f"there are {df['other_sales'].isna().sum()} missing values in 'other_sales'")
print(f"there are {df['critic_score'].isna().sum()} missing values in 'critic_score'")
print(f"there are {df['user_score'].isna().sum()} missing values in 'user_score'")
print(f"there are {df['rating'].isna().sum()} missing values in 'rating'")
there are 0 missing values in 'platform'
there are 269 missing values in 'year_of_release'
there are 0 missing values in 'genre'
there are 0 missing values in 'na_sales'
there are 0 missing values in 'eu_sales'
there are 0 missing values in 'jp_sales'
there are 0 missing values in 'other_sales'
there are 8576 missing values in 'critic_score'
there are 6699 missing values in 'user_score'
there are 6764 missing values in 'rating'
Comentario del revisor. (Iteración 1)

Perfecto! Haces un correcto uso del método isna para hacer la revisión de nulos sobre el dataframe.

Fill missing values

In [12]:
# year_of_release
df['year_of_release'] = df['year_of_release'].fillna(0)

# critic_score and user_score
df['critic_score'] = df['critic_score'].fillna(0)
df['user_score'] = df['user_score'].fillna(0)

# rating
df['rating'] = df['rating'].fillna('Unknown')
Comentario del revisor. (Iteración 1)

Me gusta la estrategia que sigues para hacer la imputación de valores nulos sobre las columnas year_of_release y rating, sin embargo, sobre clos score, tanto de los criticos como de los usuarios, deberíamos tener cuidado, debido a que 0, puede ser considerado una escala más.

In [13]:
print(f"there are {df['year_of_release'].isna().sum()} missing values in 'year_of_release'")
print(f"there are {df['critic_score'].isna().sum()} missing values in 'critic_score'")
print(f"there are {df['user_score'].isna().sum()} missing values in 'user_score'")
print(f"there are {df['rating'].isna().sum()} missing values in 'rating'")
print()
print("***It is very important to remember, when calculating scores,\nto exclude 0 values, as including them might skew the results.***")
there are 0 missing values in 'year_of_release'
there are 0 missing values in 'critic_score'
there are 0 missing values in 'user_score'
there are 0 missing values in 'rating'

***It is very important to remember, when calculating scores,
to exclude 0 values, as including them might skew the results.***
In [14]:
# Check for missing values

df.isna().sum()
Out[14]:
name               0
platform           0
year_of_release    0
genre              0
na_sales           0
eu_sales           0
jp_sales           0
other_sales        0
critic_score       0
user_score         0
rating             0
dtype: int64
Change types¶
In [15]:
# Change Dtypes.
# Column "year_of_release" is not needed as float, since we are talking about years, I change it to int.
df['year_of_release'] = df['year_of_release'].astype('int')
# Column "critic_score" has no decimal values, so I change it int.
df['critic_score'] = df['critic_score'].astype('int')
# "tbd" means to be determined, for our reserch is missing data, so I change it to "0".
df['user_score'] = df['user_score'].replace({'tbd': 0}).astype('float')
In [16]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16713 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   name             16713 non-null  object 
 1   platform         16713 non-null  object 
 2   year_of_release  16713 non-null  int64  
 3   genre            16713 non-null  object 
 4   na_sales         16713 non-null  float64
 5   eu_sales         16713 non-null  float64
 6   jp_sales         16713 non-null  float64
 7   other_sales      16713 non-null  float64
 8   critic_score     16713 non-null  int64  
 9   user_score       16713 non-null  float64
 10  rating           16713 non-null  object 
dtypes: float64(5), int64(2), object(4)
memory usage: 1.5+ MB
Comentario del revisor. (Iteración 1)

Perfecto, pero mira de nuevo haces el cambio de tbd (que representa un valor nulo a cero, lo cual puede ser delicado, debido a que puede ser considerada una calificación.

Total sales¶
In [17]:
# I create "total_sales" as a new column with the sum of the 4 sales columns.
df["total_sales"] = df["na_sales"] + df["eu_sales"] + df["jp_sales"] + df["other_sales"]

# I rearenge columns order.
position = df.columns.get_loc("other_sales") + 1
df.insert(position, "total_sales", df.pop("total_sales"))
Comentario del revisor. (Iteración 1)

Perfecto! La manera en que calculas las ventas totales es correcta!

In [18]:
# The Dataframe is clean and ready to be used.
df.sample(20)
Out[18]:
name platform year_of_release genre na_sales eu_sales jp_sales other_sales total_sales critic_score user_score rating
1733 wwe smackdown vs. raw 2010 PS3 2009 fighting 0.48 0.48 0.01 0.20 1.17 81 8.1 T
4605 rush 2: extreme racing usa N64 1998 racing 0.35 0.06 0.00 0.00 0.41 0 0.0 Unknown
5354 classic nes series: ice climber GBA 2004 platform 0.10 0.04 0.20 0.01 0.35 66 0.0 E
4091 motorstorm: apocalypse PS3 2011 racing 0.21 0.19 0.00 0.08 0.48 77 7.7 T
10803 jikkyou powerful pro yakyuu 11 GC 2004 sports 0.00 0.00 0.09 0.00 0.09 0 0.0 Unknown
10086 super swing golf season 2 WII 2007 sports 0.08 0.00 0.03 0.01 0.12 70 0.0 E10+
10765 csi: hard evidence X360 2007 adventure 0.07 0.02 0.00 0.01 0.10 0 0.0 Unknown
9872 just dance: disney party WII 2012 misc 0.01 0.10 0.00 0.01 0.12 0 0.0 E
15165 hatsukare * renai debut sengen! PSP 2013 adventure 0.00 0.00 0.02 0.00 0.02 0 0.0 Unknown
4829 monster trucks mayhem WII 2009 racing 0.38 0.00 0.00 0.02 0.40 0 0.0 E
372 donkey kong country 3: dixie kong's double tro... SNES 1996 platform 1.17 0.50 1.75 0.08 3.50 0 0.0 Unknown
3937 the adventures of jimmy neutron boy genius: at... GBA 2004 action 0.36 0.13 0.00 0.01 0.50 56 0.0 E
9836 dirt 2 DS 2009 racing 0.06 0.04 0.00 0.01 0.11 73 0.0 E
16127 kono aozora ni yakusoku o: tenohira no rakuen PSP 2009 adventure 0.00 0.00 0.01 0.00 0.01 0 0.0 Unknown
15098 peter pan in disney's return to never land PS 2002 platform 0.01 0.01 0.00 0.00 0.02 0 0.0 Unknown
3120 007: quantum of solace WII 2008 action 0.29 0.28 0.01 0.07 0.65 54 7.5 T
4462 singstar queen PS3 2009 misc 0.11 0.25 0.00 0.07 0.43 0 0.0 Unknown
1520 mega man battle network 3 blue / white version GBA 2002 role-playing 0.39 0.02 0.85 0.03 1.29 0 0.0 Unknown
8343 transformers: dark of the moon WII 2011 action 0.09 0.06 0.00 0.02 0.17 0 0.0 Unknown
14979 blazblue: chrono phantasma extend PSV 2015 action 0.00 0.02 0.00 0.01 0.03 0 6.8 T

Step 3: Data analysis¶

Games released by year¶
In [19]:
# How many games were released per year.
df.groupby('year_of_release')['name'].count().sort_values(ascending=False).reset_index().head()
Out[19]:
year_of_release name
0 2008 1427
1 2009 1426
2 2010 1255
3 2007 1197
4 2011 1136
In [20]:
import plotly.express as px

# years "0" are missing values, thats why I need to filter the df before creating an histogram.
df_filtered = df[df["year_of_release"] != 0]

hist_releases = px.histogram(df_filtered, x="year_of_release",
                   title="Most of the releases were between 2007 and 2011",
                   labels={"year_of_release": "Year"}
                   #text_auto=True
                            )

# Mostrar la gráfica
hist_releases.show()
Comentario del revisor. (Iteración 1)

Muy interesante ¿No crees?¿Por qué será que la cantidad de videojuegos empieza a decrecer a partir del 2012?

Sales per platform¶
In [21]:
# Platforms with more total sales.
best_sellers = df.groupby('platform')['total_sales'].sum().sort_values(ascending=False).reset_index().head(6)
best_sellers
Out[21]:
platform total_sales
0 PS2 1255.77
1 X360 971.42
2 PS3 939.65
3 WII 907.51
4 DS 806.12
5 PS 730.86
Comentario del revisor. (Iteración 1)

Muy bien, el cálculo está bien realizado.

Lifetime per platform¶
In [22]:
# Best platforms selling titles.
# df_filtered excludes years with value "0"

""" Play Station One """

ps_one = df_filtered[df_filtered['platform'] == 'PS']

hist_sales = px.histogram(
    ps_one, # Data
    x="year_of_release",  # Eje X: los años
    y="total_sales",  # Eje Y: ingresos
    title="Play Station One lifetime",
    color_discrete_sequence=["grey"],
    labels={"year_of_release": "Year",
            "total_sales": "Sales"}
    #text_auto=True
)

# Mostrar la gráfica
hist_sales.show()
Comentario del revisor. (Iteración 1)

Vale, el gráfico se ve bien, pero estoy notando que a lo largo de los gráficos que realizas, no estás colocando tus observaciones y conclusiones, es importante que realices esto, debido a que es un habito que en el futuro te ayudará, recuerda que traducir lo que nos datos nos quieren decir, es una de las principales labores de un analista.

In [23]:
""" Nintendo DS """

nintendo_ds = df_filtered[df_filtered['platform'] == 'DS']

hist_sales = px.histogram(
    nintendo_ds, # Data
    x="year_of_release",  # Eje X: los años
    y="total_sales",  # Eje Y: ingresos
    title="Nintendo DS lifetime",
    color_discrete_sequence=["blue"],
    labels={"year_of_release": "Year",
            "total_sales": "Sales"},
    range_x=[2003, 2015]
    #text_auto=True
)

# Mostrar la gráfica
hist_sales.show()
Comentario del revisor. (Iteración 1)

POr ejemplo aquí ¿POr que decides hacer el filtro entre 2003 y 2015 y no considerar todo el rango de fechas?

In [24]:
""" Nintendo Wii """

nintendo_wii = df_filtered[df_filtered['platform'] == 'WII']

hist_sales = px.histogram(
    nintendo_wii, # Data
    x="year_of_release",  # Eje X: los años
    y="total_sales",  # Eje Y: ingresos
    title="Nintendo Wii lifetime",
    color_discrete_sequence=["white"],
    labels={"year_of_release": "Year",
            "total_sales": "Sales"},
    range_x=[2005, 2017]
    #text_auto=True
)

# Mostrar la gráfica
hist_sales.show()
In [25]:
""" Play Station 3 """

ps_3 = df_filtered[df_filtered['platform'] == 'PS3']

hist_sales = px.histogram(
    ps_3, # Data
    x="year_of_release",  # Eje X: los años
    y="total_sales",  # Eje Y: ingresos
    title="Play Station 3 lifetime",
    color_discrete_sequence=["black"],
    labels={"year_of_release": "Year",
            "total_sales": "Sales"}
    #text_auto=True
)

# Mostrar la gráfica
hist_sales.show()
In [26]:
""" Xbox 360 """

xbox_360 = df_filtered[df_filtered['platform'] == 'X360']

hist_sales = px.histogram(
    xbox_360, # Data
    x="year_of_release",  # Eje X: los años
    y="total_sales",  # Eje Y: ingresos
    title="Xbox 360 lifetime",
    color_discrete_sequence=["rgb(242,242,242)"],
    labels={"year_of_release": "Year",
            "total_sales": "Sales"}
    #text_auto=True
)

# Mostrar la gráfica
hist_sales.show()
In [27]:
""" Play Station 2 """

ps_2 = df_filtered[df_filtered['platform'] == 'PS2']

hist_sales = px.histogram(
    ps_2, # Data
    x="year_of_release",  # Eje X: los años
    y="total_sales",  # Eje Y: ingresos
    title="Play Station 2 lifetime",
    color_discrete_sequence=["rgb(102,102,102)"],
    labels={"year_of_release": "Year",
            "total_sales": "Sales"}
    #text_auto=True
)

# Mostrar la gráfica
hist_sales.show()
Comentario del revisor. (Iteración 1)

En todos los casos, los gráficos están correctos, sin embargo es importante que los interpretes, esto hará tu notebook mucho más completo. Otra sugerencia que tengo es que debido a que estás realizando una comparación de diferentes plataformas, valdría mucho la pena que colocarás estos en un solo plot, para poder observar mejor las diferencias.

Comparison¶
In [28]:
# I create a dataframe with only the Best selling platforms.

best_platforms_names = best_sellers['platform']
df_best_platforms = df_filtered[df_filtered['platform'].isin(best_platforms_names)]
df_best_sellers = df_best_platforms[['platform', 'name', 'year_of_release', 'total_sales']]
df_best_sellers
Out[28]:
platform name year_of_release total_sales
0 WII wii sports 2006 82.54
2 WII mario kart wii 2008 35.52
3 WII wii sports resort 2009 32.77
6 DS new super mario bros. 2006 29.80
7 WII wii play 2006 28.91
... ... ... ... ...
16700 DS mezase!! tsuri master ds 2009 0.01
16704 DS plushees 2008 0.01
16709 PS2 score international baja 1000: the official game 2008 0.00
16710 PS3 samurai warriors: sanada maru 2016 0.01
16711 X360 lma manager 2007 2006 0.01

9262 rows × 4 columns

In [29]:
# I group the total sales by platform and year.

grouped_df = df_best_sellers.groupby(["platform", "year_of_release"], as_index=False)['total_sales'].sum()
grouped_df = grouped_df.drop(index=0).reset_index(drop=True)
grouped_df
Out[29]:
platform year_of_release total_sales
0 DS 2004 17.27
1 DS 2005 130.14
2 DS 2006 119.81
3 DS 2007 146.94
4 DS 2008 145.31
... ... ... ...
61 X360 2012 99.74
62 X360 2013 88.58
63 X360 2014 34.74
64 X360 2015 11.96
65 X360 2016 1.52

66 rows × 3 columns

In [30]:
# Crear la gráfica de líneas
fig = px.line(
    grouped_df,
    x="year_of_release",  # Eje X: Año de lanzamiento
    y="total_sales",  # Eje Y: Ventas totales
    color="platform",  # Líneas separadas por consola
    markers=True,
    title="The most successful platforms tend to remain in the market for around 10 years",
    labels={"year_of_release": "Years", "total_sales": "Total Sales (in millions)", "platform": "Platform"}
)

# Mostrar la gráfica
fig.show()
Comentario del revisor. (Iteración 1)

Perfecto! El gráfico se ve realmente bien y se lee bastante fácil.

Used to be popular¶

PS2, X360, PS3, Wii, DS, and PS were once the most successful platforms and, historically, the most lucrative. However, none of them are active these days (as of 2016), They had a life span of approximately 10 years, they are no longer generating sales. On the other hand, there are some new platforms that are just emerging and starting to grow. I'll explore them next.

Looking for the next bestsellers¶
In [31]:
# Best selling platforms since 2014
df_filtered[df_filtered['year_of_release'] >= 2014].groupby('platform')['total_sales'].sum().sort_values(ascending=False).head(5)
Out[31]:
platform
PS4     288.15
XONE    140.36
3DS      86.68
PS3      68.18
X360     48.22
Name: total_sales, dtype: float64
In [32]:
df = df_filtered
In [33]:
# New best selling platforms
new_sellers = df[df['platform'].isin(['PS4', 'XONE', '3DS'])]
new_sellers_by_years = new_sellers.groupby(['platform', 'year_of_release'])['total_sales'].sum().reset_index()
In [34]:
# line graph

fig = px.line(
    new_sellers_by_years,
    x="year_of_release",  # Eje X: Año de lanzamiento
    y="total_sales",  # Eje Y: Ventas totales
    color="platform",  # Líneas separadas por consola
    markers=True,
    title="PS4 is the most successful platform currently",
    labels={"year_of_release": "Years", "total_sales": "Total Sales (in millions)", "platform": "Platform"}
)

# Mostrar la gráfica
fig.show()
Comentario del revisor. (Iteración 1)

De nuevo, creo que hace falta colocar tus observaciones. Por otro lado, me gusta mucho que estés haciendo uso de plotly, debido a que es una herramienta que ya es un poco más avanzada, así que felicidades por esto.

Box graphs¶
In [35]:
fig = px.box(
    df,
    x="platform",
    y="total_sales",
    color="platform",
    title="Total sales distribution",
    labels={"platform": "Platform", "total_sales": "Total Sales (in millions)"}
)

fig.update_layout(
    height=800,  # Aumentar altura
    width=1200   # Aumentar ancho
)

fig.show()
Comentario del revisor. (Iteración 1)

Aquí trataría de cambiar un poco la escala o bien hacer dos gráficos, uno con las plataformas con más outliers para que se pueda distinguir bien las cajas de cada juego.

In [36]:
#outliers
df.query("platform == 'WII' and total_sales > 1.9")
Out[36]:
name platform year_of_release genre na_sales eu_sales jp_sales other_sales total_sales critic_score user_score rating
0 wii sports WII 2006 sports 41.36 28.96 3.77 8.45 82.54 76 8.0 E
2 mario kart wii WII 2008 racing 15.68 12.76 3.79 3.29 35.52 82 8.3 E
3 wii sports resort WII 2009 sports 15.61 10.93 3.28 2.95 32.77 80 8.0 E
7 wii play WII 2006 misc 13.96 9.18 2.93 2.84 28.91 58 6.6 E
8 new super mario bros. wii WII 2009 platform 14.44 6.94 4.70 2.24 28.32 87 8.4 E
... ... ... ... ... ... ... ... ... ... ... ... ...
818 kirby's epic yarn WII 2010 platform 1.42 0.09 0.45 0.10 2.06 0 0.0 Unknown
819 wipeout: the game WII 2009 misc 1.94 0.00 0.00 0.12 2.06 0 0.0 Unknown
843 rayman raving rabbids: tv party WII 2008 misc 0.72 1.08 0.00 0.23 2.03 73 7.7 E10+
850 just dance 2015 WII 2014 misc 0.99 0.85 0.00 0.17 2.01 0 8.4 E10+
887 call of duty: world at war WII 2008 shooter 1.17 0.58 0.00 0.18 1.93 83 7.6 M

64 rows × 12 columns

In [37]:
wii = df.query("platform == 'WII'")

fig = px.box(
    wii,
    x="total_sales",
    y="platform",
    title="Wii Total sales distribution",
    labels={"platform": " ", "total_sales": "Total Sales (in millions)"}
)

fig.update_layout(
    height=800,  # Aumentar altura
    width=1100   # Aumentar ancho
)

fig.show()
In [38]:
nes = df.query("platform == 'NES' and total_sales > 4.38")

fig = px.box(
    nes,
    x="total_sales",
    y="platform",
    title="NES Total sales distribution",
    labels={"platform": " ", "total_sales": "Total Sales (in millions)"}
)

fig.update_layout(
    height=800,  # Aumentar altura
    width=1100   # Aumentar ancho
)

fig.show()
In [39]:
gb = df.query("platform == 'GB'")

fig = px.box(
    gb,
    x="total_sales",
    y="platform",
    title="GB Total sales distribution",
    labels={"platform": " ", "total_sales": "Total Sales (in millions)"}
)

fig.update_layout(
    height=800,  # Aumentar altura
    width=1100   # Aumentar ancho
)

fig.show()
In [40]:
ds = df.query("platform == 'DS'")

fig = px.box(
    ds,
    x="total_sales",
    y="platform",
    title="DS Total sales distribution",
    labels={"platform": " ", "total_sales": "Total Sales (in millions)"}
)

fig.update_layout(
    height=800,  # Aumentar altura
    width=1100   # Aumentar ancho
)

fig.show()

Most platforms have an mean close to zero.

Comentario del revisor. (Iteración 1)

Ok, veo que si separaste algunos boxplots, eso está perfecto, aquí se logra apreciar mucho mejor las características de las distribuciones que obtienes.

Corelation¶
In [41]:
# Box graph to understand outlier values on sales

sales = df.query("total_sales != 0")
fig = px.box(
    sales,
    x="total_sales")

fig.update_layout(
    xaxis_range=[0, 2])

fig.show()
Comentario del revisor. (Iteración 1)

Vale, entiendo que este gráfico refleja la distribución de las ventas totales, pero no olvides colocar titulos y nombres a los ejes para que este sea mucho más facil de leer.

In [42]:
# Critic Score VS Total Sales
critic_score = df.query("critic_score != 0 and total_sales < 1.1")

correlation_critic_score = critic_score["critic_score"].corr(critic_score["total_sales"])
print(f"Correlation: {correlation_critic_score}")

# Fig
fig = px.scatter(
    critic_score,
    x="total_sales",
    y="critic_score",
    title="There is a slight positive correlation between Critic Scores and Sales.",
    labels={"critic_score": "Critic Score", "total_sales": "Total Sales (in millions)"},
    trendline="ols"
)

for trace in fig.data:
    if trace.mode == "lines":  # Identificar la línea de tendencia
        trace.line.color = "red"  # Cambiar el color

fig.update_layout(
    yaxis_range=[0, 100], xaxis_range=[0, 1.1])

# Mostrar la gráfica
fig.show()
Correlation: 0.267071717767249
In [43]:
# User Score VS Total Sales

user_score = df.query("user_score != 0 and total_sales < 1.1")

correlation_user_score = user_score["user_score"].corr(user_score["total_sales"])
print(f"Correlation: {correlation_user_score}")

fig = px.scatter(
    user_score,
    x="total_sales",            # Eje X
    y="user_score",             # Eje Y
    title="There is a slight positive correlation between User Score and Sales",
    labels={"critic_score": "Critic Score", "total_sales": "Total Sales (in millions)"},
    trendline="ols"              # Opcional: agrega una línea de tendencia
)
for trace in fig.data:
    if trace.mode == "lines":  # Identificar la línea de tendencia
        trace.line.color = "red"  # Cambiar el color

fig.update_layout(
    yaxis_range=[0, 10],
    xaxis_range=[0, 1.1])

# Mostrar la gráfica
fig.show()
Correlation: 0.10769542636565735

There is a correlation between rating and sales, but it is very low in both cases.

Comentario del revisor. (Iteración 1)

Si, parece ser que las ventas no están asociadas a las reseñas de ningún tipo, esto es interesante, pues parece que los usuarios las compran por la historia y la fama de las marcas.

Same games on diferent platforms¶
In [44]:
df.groupby('name')['platform'].count().sort_values(ascending=False)
Out[44]:
name
need for speed: most wanted     12
ratatouille                      9
lego marvel super heroes         9
fifa 14                          9
madden nfl 07                    8
                                ..
infamous: first light            1
infamous: second son             1
infected                         1
infernal: hell's vengeance       1
¡shin chan flipa en colores!     1
Name: platform, Length: 11426, dtype: int64
In [45]:
# Most lucrative games on diferent platforms

result = df.groupby("name").agg(
    platform=("platform", lambda x: " - ".join(sorted(x.unique()))),  # Listar plataformas únicas
    total_sales=("total_sales", "sum"),  # Sumar las ventas totales
).reset_index()

result.sort_values(by="total_sales", ascending=False).head(20)
Out[45]:
name platform total_sales
10927 wii sports WII 82.54
3691 grand theft auto v PC - PS3 - PS4 - X360 - XONE 56.58
9300 super mario bros. GB - NES 45.31
9660 tetris GB - NES 35.84
5501 mario kart wii WII 35.52
10929 wii sports resort WII 32.77
7316 pokemon red/pokemon blue GB 31.38
1220 call of duty: modern warfare 3 PC - PS3 - WII - X360 30.60
6614 new super mario bros. DS 29.80
1215 call of duty: black ops ii PC - PS3 - WIIU - X360 29.40
1213 call of duty: black ops DS - PS3 - WII - X360 29.20
10925 wii play WII 28.91
6617 new super mario bros. wii WII 28.32
2580 duck hunt NES 28.31
1217 call of duty: ghosts PC - PS3 - PS4 - WIIU - X360 - XONE 27.39
9314 super mario world GBA - SNES 26.09
1214 call of duty: black ops 3 PC - PS3 - PS4 - X360 - XONE 25.67
1219 call of duty: modern warfare 2 PC - PS3 - X360 24.97
6770 nintendogs DS 24.67
5862 minecraft PS3 - PS4 - PSV - WIIU - X360 - XONE 24.16
In [46]:
# Grand Theft Auto V
grand_theft_auto_v = df.loc[df['name'] == "grand theft auto v"]
grand_theft_auto_v['total_sales'].round(2)
grand_theft_auto_v
Out[46]:
name platform year_of_release genre na_sales eu_sales jp_sales other_sales total_sales critic_score user_score rating
16 grand theft auto v PS3 2013 action 7.02 9.09 0.98 3.96 21.05 97 8.2 M
23 grand theft auto v X360 2013 action 9.66 5.14 0.06 1.41 16.27 97 8.1 M
42 grand theft auto v PS4 2014 action 3.96 6.31 0.38 1.97 12.62 97 8.3 M
165 grand theft auto v XONE 2014 action 2.81 2.19 0.00 0.47 5.47 97 7.9 M
1730 grand theft auto v PC 2015 action 0.39 0.69 0.00 0.09 1.17 96 7.9 M
In [47]:
fig = px.bar(
    grand_theft_auto_v,
    x="platform",
    y="total_sales",
    title="grand_theft_auto_v",
    labels={"platform": "Platform", "total_sales": "Total Sales (in millions)"},
    color="platform",
    text="total_sales"  # Mostrar valores en las barras
)

fig.show()
In [48]:
# Super Mario
# Not sure to count it as the same game, since the second one was released 14 years after the first one.

super_mario = df[df['name'] == "super mario bros."]
super_mario.loc[:, 'total_sales'].round(2)
super_mario
Out[48]:
name platform year_of_release genre na_sales eu_sales jp_sales other_sales total_sales critic_score user_score rating
1 super mario bros. NES 1985 platform 29.08 3.58 6.81 0.77 40.24 0 0.0 Unknown
201 super mario bros. GB 1999 platform 3.40 1.30 0.15 0.22 5.07 0 0.0 Unknown
In [49]:
fig = px.bar(
    super_mario,
    x="platform",
    y="total_sales",
    title="super_mario",
    color="platform",
    labels={"platform": "Platform", "total_sales": "Total Sales (in millions)"}
)

fig.show()
In [50]:
# Tetris
tetris = df[df['name'] == "tetris"]
tetris
Out[50]:
name platform year_of_release genre na_sales eu_sales jp_sales other_sales total_sales critic_score user_score rating
5 tetris GB 1989 puzzle 23.20 2.26 4.22 0.58 30.26 0 0.0 Unknown
157 tetris NES 1988 puzzle 2.97 0.69 1.81 0.11 5.58 0 0.0 Unknown
In [51]:
fig = px.bar(
    tetris,
    x="platform",
    y="total_sales",
    title="Tetris",
    color="platform",
    labels={"platform": "Platform", "total_sales": "Total Sales (in millions)"}
)

fig.show()
In [52]:
# call of duty: modern warfare 3
cod = df[df['name'] == "call of duty: modern warfare 3"]
cod
Out[52]:
name platform year_of_release genre na_sales eu_sales jp_sales other_sales total_sales critic_score user_score rating
29 call of duty: modern warfare 3 X360 2011 shooter 9.04 4.24 0.13 1.32 14.73 88 3.4 M
37 call of duty: modern warfare 3 PS3 2011 shooter 5.54 5.73 0.49 1.57 13.33 88 3.2 M
1040 call of duty: modern warfare 3 PC 2011 shooter 0.41 0.98 0.00 0.32 1.71 78 2.5 M
2500 call of duty: modern warfare 3 WII 2011 shooter 0.55 0.20 0.00 0.08 0.83 70 1.8 M
In [53]:
fig = px.bar(
    cod,
    x="platform",
    y="total_sales",
    title="call of duty: modern warfare 3",
    color="platform",
    text="total_sales",
    labels={"platform": "Platform", "total_sales": "Total Sales (in millions)"}
)

fig.show()

It seems that titles sell better on consoles than on PC. Additionally, games on Game Boy might perform well, but they don't even come close to the numbers achieved on the NES.

Comentario del revisor. (Iteración 1)

Perfecto, me gusta el análisis que estás realizando y la conclusión que obtienes.

Genre vs Sales¶
In [54]:
# I group and arrange the values as needed. (I formatted the numbers to show only two decimal places.)

sales_sorted = sales.groupby("genre", as_index=False)["total_sales"].sum().sort_values(by="total_sales", ascending=False)
sales_sorted["total_sales"] = sales_sorted["total_sales"].round(2)
sales_sorted
Out[54]:
genre total_sales
0 action 1716.52
10 sports 1309.67
8 shooter 1041.36
7 role-playing 931.08
4 platform 825.55
3 misc 790.29
6 racing 723.49
2 fighting 442.66
9 simulation 387.96
5 puzzle 239.89
1 adventure 233.33
11 strategy 172.57
In [55]:
# Distribution of genre by sales

fig = px.bar(
    sales_sorted,
    x="genre",
    y="total_sales",
    title="Genre distribution by sales",
    labels={"genre": "Genre", "total_sales": "Sales (in millions)"},
    color="genre",
    text="total_sales"  # Mostrar valores en las barras
)

fig.show()
Comentario del revisor. (Iteración 1)

Excelente gráfico

About genres in videogames

Apparently, it seems more lucrative to sell action games than strategy games. However, I should analyze the data more closely to clarify this hypothesis. Perhaps the "action" genre generates the highest amount of revenue because it is the most popular genre and includes a large number of titles, not necessarily because all of those games are major hits. On the other hand, "strategy" games might have a few great-selling titles but lack the sheer number of releases.

In [56]:
# DF games by name
games_grouped = df.groupby("name", as_index=False).agg(
    platform=("platform", lambda x: ", ".join(sorted(x.unique()))),
    total_sales=("total_sales", "sum"), genre=("genre", lambda x: ", ".join(sorted(x.unique())))
    ).sort_values(by="total_sales", ascending=False)

games_grouped
Out[56]:
name platform total_sales genre
10927 wii sports WII 82.54 sports
3691 grand theft auto v PC, PS3, PS4, X360, XONE 56.58 action
9300 super mario bros. GB, NES 45.31 platform
9660 tetris GB, NES 35.84 puzzle
5501 mario kart wii WII 35.52 racing
... ... ... ... ...
2860 exhibition volume 6 XB 0.01 misc
2843 evil genius PC 0.01 strategy
6981 ozzy & drix GBA 0.01 platform
4784 katekyoo hitman reborn nerae!? ring x bongole ... PS2 0.01 adventure
3399 g1 jockey 4 2008 PS3 0.00 sports

11426 rows × 4 columns

In [57]:
# Count of games by Genre
games_grouped['genre'].value_counts()
Out[57]:
action               1929
sports               1366
misc                 1317
role-playing         1213
adventure            1052
shooter               812
racing                761
simulation            717
fighting              606
platform              579
strategy              578
puzzle                491
action, adventure       2
shooter, strategy       1
misc, strategy          1
action, shooter         1
Name: genre, dtype: int64

"Action" is the most popular genre among all the games

Action includes 1,929 titles, followed by Sports with 1,366. On the other hand, Strategy has only 578 titles.

In [58]:
#Action games
action_games = games_grouped[games_grouped['genre'] == 'action']
action_games.sort_values(by='total_sales', ascending=False).head(20)
Out[58]:
name platform total_sales genre
3691 grand theft auto v PC, PS3, PS4, X360, XONE 56.58 action
3695 grand theft auto: san andreas PC, PS2, X360, XB 23.85 action
3690 grand theft auto iv PC, PS3, X360 22.37 action
3696 grand theft auto: vice city PC, PS2 16.19 action
3090 fifa soccer 13 3DS, PC, PS3, PSP, PSV, WII, WIIU, X360 15.97 action
5157 lego star wars: the complete saga DS, PS3, WII, X360 15.25 action
3689 grand theft auto iii PC, PS2 13.11 action
487 assassin's creed iv: black flag PC, PS3, PS4, WIIU, X360, XONE 13.06 action
485 assassin's creed iii PC, PS3, WIIU, X360 13.01 action
7795 red dead redemption PS3, X360 12.80 action
9907 the legend of zelda: ocarina of time 3DS, N64 11.93 action
7304 pokemon heartgold/pokemon soulsilver DS 11.77 action
483 assassin's creed ii PC, PS3, X360 11.34 action
3693 grand theft auto: liberty city stories PS2, PSP 11.21 action
671 batman: arkham city PC, PS3, WIIU, X360 11.10 action
9876 the last of us PS3, PS4 10.58 action
9085 star wars: the force unleashed DS, PS2, PS3, PSP, WII, X360 10.02 action
5140 lego indiana jones: the original adventures DS, PC, PS2, PS3, PSP, X360 9.57 action
5146 lego marvel super heroes 3DS, DS, PC, PS3, PS4, PSV, WIIU, X360, XONE 9.41 action
10853 watch dogs PC, PS3, PS4, WIIU, X360, XONE 9.18 action
In [59]:
#I create a box plot to understand the mean and outliers.
# Following that, I generate a histogram to visualize the data distribution.

fig = px.box(
    action_games,
    x="total_sales",
    title="Action games layouts",
    color_discrete_sequence=["#1F77B4"],
    labels={"total_sales": "Total Sales (in millions)"}
)

fig.update_layout(
    xaxis_range=[0, 1])

fig.show()
In [60]:
fig = px.histogram(
    action_games,
    x="total_sales",
    title="Action games distribution",
    color_discrete_sequence=["#1F77B4"],
    labels={"total_sales": "Total Sales (in millions)"}
)
fig.update_layout(
    xaxis_range=[0, 7])

fig.show()
In [61]:
# Strategy games
strategy = games_grouped[games_grouped['genre'] == 'strategy']
strategy.sort_values(by='total_sales', ascending=False).head(20)
Out[61]:
name platform total_sales genre
7321 pokemon stadium N64 5.45 strategy
10850 warzone 2100 PS 5.01 strategy
9094 starcraft ii: wings of liberty PC 4.83 strategy
10804 warcraft ii: tides of darkness PC 4.20 strategy
7339 pokémon trading card game GB 3.70 strategy
1557 command & conquer: red alert PC, PS 3.27 strategy
7338 pokémon stadium 2 N64 2.74 strategy
8617 sim theme park PC, PS 2.71 strategy
3882 halo wars X360 2.62 strategy
10102 theme hospital PC, PS 2.49 strategy
10807 warcraft: orcs & humans PC 2.08 strategy
11324 yu-gi-oh! the eternal duelist soul GBA 2.07 strategy
8585 sid meier's civilization revolution DS, PS3, X360 2.05 strategy
11201 xcom: enemy unknown PC, PS3, X360 1.90 strategy
10253 tom clancy's endwar DS, PC, PS3, PSP, X360 1.82 strategy
331 angry birds star wars 3DS, PS3, PS4, PSV, WII, WIIU, X360, XONE 1.73 strategy
9092 starcraft ii: heart of the swarm PC 1.67 strategy
8587 sid meier's civilization v PC 1.67 strategy
7215 pikmin GC 1.63 strategy
11304 yu-gi-oh! duel monsters GB 1.62 strategy
In [62]:
#Strategy outliers
fig = px.box(
    strategy,
    x="total_sales",
    title="Strategy games layouts",
    color_discrete_sequence=["#fb8500"],
    labels={"total_sales": "Total Sales (in millions)"}
)
fig.update_layout(
    xaxis_range=[0, 1])

fig.show()
In [63]:
# Strategy distribution

fig = px.histogram(
    strategy,
    x="total_sales",
    title="Strategy games distribution",
    color_discrete_sequence=["#fb8500"],
    labels={"total_sales": "Total Sales (in millions)"}
)

fig.update_layout(
    xaxis_range=[0, 6])

fig.show()

Action games generally generate better revenue than strategy games. However, that doesn’t exclude strategy games from having great titles that can achieve good sales. Nevertheless, action games are generally better received by the audience than strategy games.

Comentario del revisor. (Iteración 1)

Perfecto, vas realizando un gran analisis, solamente ten cuidado con no olvidar tus observaciones en cada gráfico.

Step 4: User profile¶

In [64]:
# I group the df per platform.
# recent years.
df_recent_years = df[df['year_of_release'] >= 2013]
platforms_grouped = df_recent_years.groupby('platform')[['na_sales', 'eu_sales', 'jp_sales', 'total_sales']].sum().reset_index()
platforms_grouped.sort_values(by='total_sales', ascending=False).head()
Out[64]:
platform na_sales eu_sales jp_sales total_sales
4 PS4 108.74 141.09 15.96 314.14
3 PS3 63.50 67.81 23.35 181.43
10 XONE 93.12 51.59 0.34 159.32
0 3DS 38.20 30.96 67.81 143.25
9 X360 81.66 42.52 0.51 136.80
5 Principal platforms per region (NA, UE, JP)¶
In [65]:
""" North America """
america_platforms = platforms_grouped.sort_values(by='na_sales', ascending=False)[['platform', 'na_sales']]
america_platforms['na_sales'] = america_platforms['na_sales'].round(2)

fig = px.bar(
    america_platforms.head(5),
    x="platform",
    y="na_sales",
    title="Principal platforms in North America",
    color="platform",
    labels={"platform": "Platform", "na_sales": "Sales (in millions)"},
    text="na_sales"  # Mostrar valores en las barras
)

fig.show()

best_na_platforms = []
for console in america_platforms['platform'].head():
  best_na_platforms.append(console)

print(f"The best selling platforms in North America are:\n{best_na_platforms}")
The best selling platforms in North America are:
['PS4', 'XONE', 'X360', 'PS3', '3DS']
Comentario del revisor. (Iteración 1)

En este caso, creo que ayudaría mucho el redondear algunas de las cifras, para que no estén causando tanto ruido los ceros que aparecen en cada barra.

In [66]:
""" Europe """
europe_platforms = platforms_grouped.sort_values(by='eu_sales', ascending=False)[['platform', 'eu_sales']]
europe_platforms['eu_sales'] = europe_platforms['eu_sales'].round(2)

fig = px.bar(
    europe_platforms.head(5),
    x="platform",
    y="eu_sales",
    title="Principal platforms in Europe",
    color="platform",
    text="eu_sales",
    labels={"platform": "Platform", "eu_sales": "Sales (in millions)"},

)

fig.show()

best_eu_platforms = []
for console in europe_platforms['platform'].head():
  best_eu_platforms.append(console)

print(f"The best selling platforms in Europe are:\n{best_na_platforms}")
The best selling platforms in Europe are:
['PS4', 'XONE', 'X360', 'PS3', '3DS']
In [67]:
""" Japan """
japan_platforms = platforms_grouped.sort_values(by='jp_sales', ascending=False)[['platform', 'jp_sales']]
japan_platforms['jp_sales'] = japan_platforms['jp_sales'].round(2)

fig = px.bar(
    japan_platforms.head(5),
    x="platform",
    y="jp_sales",
    title="Principal platforms in Japan",
    color="platform",
    text="jp_sales",
    labels={"platform": "Platform", "jp_sales": "Sales (in millions)"},

)

fig.show()

best_jp_platforms = []
for console in japan_platforms['platform'].head():
  best_jp_platforms.append(console)

print(f"The best selling platforms in Japan are:\n{best_jp_platforms}")
The best selling platforms in Japan are:
['3DS', 'PS3', 'PSV', 'PS4', 'WIIU']
5 Principal videogame genres per region¶
In [68]:
#DF games grouped by genres
games_by_genre = (
    df_recent_years.groupby('genre', as_index=False)
    [['na_sales', 'eu_sales', 'jp_sales', 'total_sales']].sum().reset_index(drop=True))

"""I’ll avoid the miscellaneous genre since those games are so different from each other
that I don’t think they should be placed in a single category. Because of this,
I’ll treat them as missing data for now."""


games_by_genre.drop(index=3, inplace=True)
games_by_genre.reset_index(drop=True, inplace=True)
games_by_genre
Out[68]:
genre na_sales eu_sales jp_sales total_sales
0 action 126.05 118.13 40.49 321.87
1 adventure 7.14 8.25 5.82 23.64
2 fighting 15.55 8.55 7.65 35.31
3 platform 18.14 15.58 4.79 42.63
4 puzzle 0.83 1.00 1.18 3.17
5 racing 12.96 20.19 2.30 39.89
6 role-playing 46.40 36.97 51.04 145.89
7 shooter 109.74 87.86 6.61 232.98
8 simulation 4.86 10.92 4.52 21.76
9 sports 65.27 60.52 5.41 150.65
10 strategy 3.28 4.22 1.77 10.08
In [69]:
""" North America """
america_genres = games_by_genre.sort_values(by='na_sales', ascending=False)[['genre', 'na_sales']]
america_genres['na_sales'] = america_genres['na_sales'].round(2)

popular_na_genres = []
for genre in america_genres['genre'].head():
  popular_na_genres.append(genre)

fig = px.bar(
    america_genres.head(5),
    x="genre",
    y="na_sales",
    labels={"genre": "Genre", "na_sales": "Sales (in millions)"},
    title=f"Principal genres in North America are: {', '.join(popular_na_genres)}.",
    color="genre",
    text="na_sales",
    color_discrete_map={
                "shooter": "red",
                "sports": "green",
                "action": "blue",
                "role-playing": "goldenrod",
                "misc": "cyan"})

fig.show()
In [70]:
""" Europe """
europe_genres = games_by_genre.sort_values(by='eu_sales', ascending=False)[['genre', 'eu_sales']]
europe_genres['eu_sales'] = europe_genres['eu_sales'].round(2)

popular_eu_genres = []
for genre in europe_genres['genre'].head():
  popular_eu_genres.append(genre)

fig = px.bar(
    europe_genres.head(5),
    x="genre",
    y="eu_sales",
    labels={"genre": "Genre", "eu_sales": "Sales (in millions)"},
    title=f"Principal genres in Europe are: {', '.join(popular_eu_genres)}.",
    color="genre",
    text="eu_sales",
    color_discrete_map={
                "shooter": "red",
                "sports": "green",
                "action": "blue",
                "role-playing": "goldenrod",
                "misc": "magenta",
                "racing": "purple"})
fig.show()
In [71]:
""" Japan """

japan_genres = games_by_genre.sort_values(by='jp_sales', ascending=False)[['genre', 'jp_sales']]
japan_genres['jp_sales'] = japan_genres['jp_sales'].round(2)

popular_jp_genres = []
for genre in japan_genres['genre'].head():
  popular_jp_genres.append(genre)

fig = px.bar(
    japan_genres.head(5),
    x="genre",
    y="jp_sales",
    labels={"genre": "Genre", "jp_sales": "Sales (in millions)"},
    title=f"Principal genres in Japan are: {', '.join(popular_eu_genres)}.",
    color="genre",
    text="jp_sales",
    color_discrete_map={
                "shooter": "red",
                "sports": "green",
                "action": "blue",
                "role-playing": "goldenrod",
                "adventure": "orange",
                "racing": "purple",
                "fighting": "#66AA00"})
fig.show()
Comentario del revisor. (Iteración 1)

En este caso, te recomendaría el colocar los graficos de cada región juntos para poder evaluar mucho mejor las diferencias y similitudes.

Principal Genres

In North America and Europe, action games and shooter games are the most dominant genres. In contrast, in Japan, role-playing games are the most popular.

Additionally, in North America and Europe, the sports genre is quite important. However, this is not the case in Japan, where they seem to be more interested in fighting games.

Rating per region¶
In [72]:
# I group games by rating
games_by_rating = (
    df_recent_years.groupby('rating', as_index=False)
    [['na_sales', 'eu_sales', 'jp_sales', 'total_sales']].sum().reset_index(drop=True))

# I remove the unknown ratings
games_by_rating.drop(index=4, inplace=True)

columns_to_round = ['na_sales', 'eu_sales', 'jp_sales']
games_by_rating[columns_to_round] = games_by_rating[columns_to_round].round(2)
In [73]:
""" North America """

fig = px.bar(
    games_by_rating.sort_values(by='na_sales', ascending=False),
    x="rating",
    y="na_sales",
    labels={"rating": "Rating", "na_sales": "Sales (in millions)"},
    title="'M' rating leads the sales in North America",
    text="na_sales",
    color="rating",
    color_discrete_map={
                "E": "purple",
                "E10+": "yellow",
                "M": "red",
                "T": "blue"})

fig.show()
In [74]:
""" Europe """

fig = px.bar(
    games_by_rating.sort_values(by='eu_sales', ascending=False),
    x="rating",
    y="eu_sales",
    labels={"rating": "Rating", "eu_sales": "Sales (in millions)"},
    title="'E' rating leads the sales in Europe",
    text="eu_sales",
    color="rating",
    color_discrete_map={
                "E": "purple",
                "E10+": "yellow",
                "M": "red",
                "T": "blue"})

fig.show()
In [75]:
""" Japan """

fig = px.bar(
    games_by_rating.sort_values(by='jp_sales', ascending=False),
    x="rating",
    y="jp_sales",
    labels={"rating": "Rating", "jp_sales": "Sales (in millions)"},
    title="'T' Rate leads the sales in Japan",
    text="jp_sales",
    color="rating",
    color_discrete_map={
                "E": "purple",
                "E10+": "yellow",
                "M": "red",
                "T": "blue"})

fig.show()
Profiles conclusion¶

In North America, the audience prefer PlayStation 4 and Xbox One consoles, along with video game genres like action, shooters, and sports. They also tend to favor games with a Mature (M) rating.

In Europe, the audience prefers the PlayStation 4 console, along with video game genres like action, shooters, and sports. They also tend to favor games with a Mature (M) rating. This is very similar to North America, but in Europe, the preference for the PS4 over the Xbox One is significantly greater.

Japan is build diferent, in this region the audience prefers de Nintendo 3DS, along with video games genres Role-playing, action, and fighting games. They also tend to favor games with a Teen (T) rating.

Step 5: hypotesis¶

Comentario del revisor. (Iteración 1)

Tampoco olvides, que las librearías a usar suelen colocarse al inicio del notebook, de esta forma tienes un proyecto con una estructura bien establecida.

In [76]:
# I avoid scores with 0, since they are missing values.
df_scores = df.loc[df["user_score"] != 0]

# Xbox user score.
xbox_user_score = df_scores[df_scores['platform'] == 'XONE']['user_score'].reset_index(drop=True)

# PC user score.
pc_user_score = df_scores[df_scores['platform'] == 'PC']['user_score'].reset_index(drop=True)
User score for Xbox and PC¶

Nule hypotesis: The mean user score is statistically the same for Xbox One and PC.

Alternate hypotesis: The mean user score is statistacally diferent for Xbox One and PC.

In [77]:
# The mean user score for the Xbox One and PC platforms are the same.
# Mean Scores
print(f"Mean user score for Xbox is {xbox_user_score.mean().round(2)}",
      f"\nMean user score for PC is {pc_user_score.mean().round(2)}")
Mean user score for Xbox is 6.52 
Mean user score for PC is 7.07

T Test 1

In [78]:
# significance level ("critical value")
alpha = .05

# Test with diferent variance, for two independent statistical population.
results = st.ttest_ind(xbox_user_score, pc_user_score, equal_var = False)

print('P Value:', results.pvalue)

if results.pvalue < alpha:
    print("We reject the null hypothesis.")
else:
    print("We fail to reject the null hypothesis.")
P Value: 4.5385802963771835e-06
We reject the null hypothesis.
Comentario del revisor. (Iteración 1)

Muy bien!! Realizaste correcatamente el test al igual que la interpretación del pvalue.

User Score for Action and Sport¶

Nule hypotesis: The mean user score is statistically the same for Action and Sport video games.

Alternate hypotesis: The mean user score is statistacally diferent for Action and Sport video games.

In [79]:
# Action games user score
action_user_score = df_scores[df_scores['genre'] == 'action']['user_score'].reset_index(drop=True)

# Spot games user score
sport_user_score = df_scores[df_scores['genre'] == 'sports']['user_score'].reset_index(drop=True)
In [80]:
# Is the mean user score statistaclly diferent for Action and Sport genres?
# Mean Scores
print(f"Mean user score for Action games is {action_user_score.mean().round(2)}",
      f"\nMean user score for Sport games is {sport_user_score.mean().round(2)}")
Mean user score for Action games is 7.06 
Mean user score for Sport games is 6.95

T Test 2

In [81]:
# significance level ("critical value")
alpha = .05

# Test with diferent variance, for two independent statistical population.
results = st.ttest_ind(action_user_score, sport_user_score, equal_var = False)

print('P Value:', results.pvalue)

if results.pvalue < alpha:
    print("We reject the null hypothesis.")
else:
    print("We fail to reject the null hypothesis.")
P Value: 0.07370217419278526
We fail to reject the null hypothesis.

Step 6: Conclusion¶

Español

A continuación enunciaré los hallazgos de esta investigación:

En este análisis descubrí que PS2, X360, PS3, WII, DS, y PS son las plataformas más exitosas en la historia de los videojuegos (hasta 2016).

Las plataformas tienden a tener un periodo de vida de 9 a 10 años. Entre los años 2007 al 2011 fue el momento que más títulos de videojuegos fueron lanzados al mercado.

El PS4, el Xbox One y el nintendo 3DS son las consolas más exitosas a la fecha de esta investigación, por sus ventas y por estar vigentes en el mercado.

Existe una ligera correlación entre las crítica de los expertos y las ventas de los videojuegos, asimismo hay ligera correlación entre la calificiación de los usuarios y las ventas.

Los mismos títulos tienen mejores ventas en consolas como PS4 o Xbox One en comparación con su contraparte para PC. De igual manera algunos de los grandes títulos lanzados para NES en los años 80, como lo son "Super Mario bros" y el legendario "Tetris", se volvieron a lanzar para Game Boy (GB) hacia finales de los años 90, y si bien estos relanzamientos lograron algunas ventas, sus números están muy por debajo de lo que lograron en NES más de una década antes.

Definitivamente la categoría más lucrativa es "Action", y la menos lucrativa es "Strategy", sin embargo vale la pena recalcar que la mayoría de los juegos son etiquetados como "Action" por lo cual puede no ser muy acertado pensar que los juegos "Action" tienden a ser mejor recibidos, sino que al ser la categoría más predominante es la categoría que más dinero ha generado.

Sobre los perfiles de usuarios por región puedo concluir lo siguiente: El perfil para Estados Unidos es PlayStation 4 o Xbox One casi a la par, con gustos en juegos clasificación M y del genero acción, shooter, o deportes.

Europa es muy similar a Estados unidos pero la predilecció en plataforma para Europa es más fuerte hacia el PlayStation 4, también prefieren los juegos clasificación M (adultos) y de los generos acción, shooter, o deportes.

Japón es el más distinto, en esta región los jugadores prefieren la consola nintentdo 3DS, quizá por ser orginaria de ese país; prefieren los juegos clasificación T (adolescentes) y también hay una diferencia en la preferencia de los géneros, pues ellos prefieren en primer lugar los Role-playing games, seguido de los juegos de acción y los juegos de pelea.

Por último, puse a prueba la hipótesis nula "las calificaciones promedio de los usuarios para las plataformas Xbox One y PC son las mismas", y descubrí que sí existe una diferencia estadística, es muy poco probable que haya sido una variación de los datos, podemos recharzar la hipótesis nula, y tomar la alternariva indicando que son difentes poblaciones.

por el contrario cuando revisé la hipótesis "las calificaciones promedio de los usuarions para los géneros de Acción y Deportes son iguales", si establezco el umbral (alpha) en .05, como es convencional, la Prueba T me indica que no podemos rechazar la hipótesis nula, por lo tanto sugiere que estadísticamente son iguales estas poblaciones; sin embargo, el P. Value de esta prueba es muy cercano al umbral (.07), si yo moviera el alfa en vez del 5% al 10% entonces la prueba me sugeriría recharzar la hipótesis nula.

English

Below, I will outline the findings of this research:

In this analysis, I discovered that the PS2, X360, PS3, Wii, DS, and PS are the most successful platforms in video game history (up to 2016).

Platforms tend to have a lifespan of 9 to 10 years. Between 2007 and 2011 was the period when the most video game titles were released to the market.

The PS4, Xbox One, and Nintendo 3DS are the most successful consoles at the time of this research, due to their sales and continued presence in the market.

There is a slight correlation between expert reviews and video game sales. Similarly, there is a slight correlation between user ratings and sales.

The same titles tend to sell better on consoles like PS4 or Xbox One compared to their PC counterparts. Similarly, some of the great titles released for the NES in the 1980s, such as Super Mario Bros. and the legendary Tetris, were re-released for the Game Boy (GB) in the late 1990s. While these re-releases achieved some sales, their numbers are far below what they achieved on the NES more than a decade earlier.

The most lucrative category is definitely "Action," while the least lucrative is "Strategy." However, it is worth noting that most games are labeled as "Action," which makes it inaccurate to assume that "Action" games tend to be better received. Instead, being the most predominant category, it is the one that has generated the most revenue.

Regarding user profiles by region, I can conclude the following:

The profile for the United States is almost evenly split between PlayStation 4 and Xbox One, with preferences for games rated M and genres such as action, shooters, or sports. Europe is very similar to the United States, but the preference for PlayStation 4 is much stronger in this region. They also prefer M-rated games and genres like action, shooters, or sports. Japan is the most distinct. In this region, players prefer the Nintendo 3DS console, perhaps because it originates from that country. They favor T-rated games (for teens), and their genre preferences differ as well, with Role-Playing Games being the most popular, followed by action and fighting games. Finally, I tested the null hypothesis: "The average user scores for Xbox One and PC are the same," and I found that there is a statistically significant difference. It is highly unlikely that this difference is due to data variation. We can reject the null hypothesis and accept the alternative, indicating that they are different populations.

On the other hand, when I tested the hypothesis, "The average user scores for Action and Sports genres are the same," if I set the threshold (alpha) at .05, as is conventional, the T-test indicates that we cannot reject the null hypothesis. This suggests that these populations are statistically the same. However, the p-value of this test is very close to the threshold (.07). If I were to move alpha from 5% to 10%, the test would suggest rejecting the null hypothesis.

Comentario del revisor. (Iteración 1)

Muy bien! Excelentes conclusiones!